PURPOSE

This document is for keeping track of how I filtered clusters identified in Illumina iSelect SNPchip data generated from the WSU hexaploid wheat program.

SETUP

To get the cluster results, I used the ‘Rmixmod’ package to fit a mixture model of multivariate gaussian components. The output of this program is a large ‘list’. * Each item in the list is a different marker * Each marker contains a sublist. The first object is the ‘best’ model output, the second object is the marker ID. * Some markers contain a third object which pertains to the index of any missing samples for that marker in the original Genome Studio export tables (‘theta_frame’ and ‘r_frame’)

Reading in the raw data and load libraries:

load('/home/mmcgowan/Dropbox/M.McGowanSharedFiles/Wheat_data/genotypeclustering/kamiak_clust/QAM_filter_basedata.RData')
library(ggplot2)
library(Rmixmod)
## Loading required package: Rcpp
## Rmixmod version 2.1.1 loaded
## R package of mixmodLib version 3.2.2
## 
## Condition of use
## ----------------
## Copyright (C)  MIXMOD Team - 2001-2013
## 
## MIXMOD is publicly available under the GPL license (see www.gnu.org/copyleft/gpl.html)
## You can redistribute it and/or modify it under the terms of the GPL-3 license.
## Please understand that there may still be bugs and errors. Use it at your own risk.
## We take no responsibility for any errors or omissions in this package or for any misfortune that may befall you or others as a result of its use.
## 
## Please report bugs at: http://www.mixmod.org/article.php3?id_article=23
## 
## More information on : www.mixmod.org
library(doParallel)
## Loading required package: foreach
## Loading required package: iterators
## Loading required package: parallel
library(foreach)
library(car)
## 
## Attaching package: 'car'
## The following object is masked from 'package:Rmixmod':
## 
##     ellipse
library(RColorBrewer)
library(gbm)
## Loading required package: survival
## Loading required package: lattice
## Loading required package: splines
## Loaded gbm 2.1.3
library(caret)
## 
## Attaching package: 'caret'
## The following object is masked from 'package:survival':
## 
##     cluster
library(pROC)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
# Loading all custom functions
functionlist <- list.files('/home/mmcgowan/Dropbox/M.McGowanSharedFiles/Wheat_data/genotypeclustering/kamiak_clust/FUNCTIONS', full.names = T)
for (i in functionlist) {
  source(i)
}

Because the data we are working with is big, I use parallel processing to speed up my scripts. Therefore, I need to initialize a parallel backend for this to work.

# Calculate the # of cores
no_cores <- detectCores() -2

# Registering parallel backend
registerDoParallel(cores = no_cores)

PRIMARY FILTERING

Before any markers can be categorized and investigated further, it is important to determine which clusters identified by Rmixmod are biologically plausable and which ones are more likely to be ‘noise’. My theory is that this ‘noise’ is due to many factors such as poor quality DNA, contamination, technician/machine error, etc. While the term ‘noise’ is somewhat vague, many of these clusters are very identifiable when looking at plot outputs. The assumption I made when choosing this clustering method is that some ‘noise’ effects are relatively small (stochastic pipette error, machine error, batch variance) and some are big (bad DNA, technician mistakes), but all are somewhat random. However, the SNP effect is less random and is instead based on biochemistry. Also, there is a polyploid effect where homeologs and homologs can alter the assay signal. Therefore, the ultimate polar theta-r coordinate that a sample falls is a combination of the SNP effect, the polyploid effect, small noise, and big noise.

My strategy for determining whether an identified cluster is a distribution based on biochemistry or noise is to use a machine learning algorithm to automatically determine whether to keep or remove a cluster. First, I will create a training dataset where I have manually classified all clusters in a relatively small but representative subset of markers. Then I will use this training dataset to teach a machine learning algorithm the best criteria to use to divide the clusters. After these criteria are determined, this model can then be used to classify the rest of the clusters.

Example Marker Cluster Plots

## Warning in chol.default(shape, pivot = TRUE): the matrix is either rank-
## deficient or indefinite

Note that this marker clearly has three major clusters and two ‘noisy’ clusters that look like outliers and not part of the major clusters.

This marker demonstrates how difficult this process is. There appears to be a 3-cluster SNP compressed on the right hand of the graph with two obvious ‘noise’ clusters. However, there is also an outlier cluster centered around (0.5,2) that may actaully be biologically real. Just to test this theory, I extracted which samples were contained in this cluster:

##                                  theta        r cluster
## Bau/Far-b1-1.Theta           0.4219012 2.132922       6
## Bau/Far-b1-34.Theta          0.4515495 2.248773       6
## Bau/Far-b1-36.Theta          0.5638779 2.039139       6
## Bau/Far-b1-39.Theta          0.5059506 1.876826       6
## Bau/Far-b1-40.Theta          0.5714495 1.949252       6
## Bau/Far-b1-41.Theta          0.4110552 2.166358       6
## Bau/Far-b1-46.Theta          0.4516166 2.320468       6
## Bau/Far-b1-47.Theta          0.4453222 2.167106       6
## Bau/Far-b1-48.Theta          0.4647934 2.102957       6
## Bau/Far-b1-49.Theta          0.5487349 1.878951       6
## Bau/Far-b1-50.Theta          0.5338429 2.040202       6
## Bau/Far-b1-51.Theta          0.4563138 2.185028       6
## Bau/Far-b1-53.Theta          0.4787483 2.071924       6
## Bau/Far-b1-55.Theta          0.5247312 1.933218       6
## Bau/Far-b1-56.Theta          0.4369121 2.180591       6
## Bau/Far-b1-58.Theta          0.5487795 1.974176       6
## Bau/Far-b2-129.Theta         0.5316229 1.936603       6
## Bau/Far-b2-131.Theta         0.4649383 2.177950       6
## Bau/Far-b2-132.Theta         0.5181820 1.998634       6
## Bau/Far-b2-133.Theta         0.5451783 1.926294       6
## Bau/Far-b2-134.Theta         0.5410533 2.097090       6
## Bau/Far-b2-137.Theta         0.5560583 1.914551       6
## Bau/Far-b2-138.Theta         0.4261789 1.964600       6
## Bau/Far-b2-149.Theta         0.4757960 2.112176       6
## Bau/Far-b2-150.Theta         0.5169359 1.899830       6
## Bau/Far-b2-153.Theta         0.5155960 1.963779       6
## Bau/Far-b2-155.Theta         0.5452015 1.942958       6
## Bau/Far-b2-187.Theta         0.4375570 2.133880       6
## AHR-Bauermeister.Theta       0.5293083 2.002929       6
## Other-Nor/Phi-b2-9-A2.Theta  0.5503911 2.093022       6
## Other-Caleb-Tucson E03.Theta 0.5246291 2.021065       6
## Other-Caleb-Tucson C06.Theta 0.5489718 2.041049       6
## QAM2-J960678-5.Theta         0.5741200 1.813585       6

Creating some training data

Before I can do any machine learning, I need a set of training data. To do this, I want to sample a subset of markers that represent the spectrum of cluster variation present in my entire set of markers. To do this, I sub-sampled from markers with a pre-filter cluster count. So I want 100 markers with 2 clusters, 100 markers with 3, 100 with 4, and 100 with 5+. I plotted each marker and manually classify each cluster for whether I think it is a biological or noise cluster for a total of 400 markers (~1400 clusters). Also, I calculated several different characteristics that I think will help. These are: * prop = the proportion of samples in the cluster * xcoord = the x-coordinate center * ycoord = the y-coordinate center * var1 = the 1st variance component (calculated from rotational covariance matrix) * var2 = the 2nd variance component * xvar = the theta-associated variance * yvar = the R-associated variance

# Extracting the number of raw, unfiltered clusters identified for each marker
clustnum = data.frame(matrix(nrow = (length = length(mixmodout)), ncol = 2))
for (i in 1:length(mixmodout)) {
  clustnum[i,1] <- mixmodout[[i]][[2]]
  clustnum[i,2] <- mixmodout[[i]][[1]][1]
}


# Indexing marker category
clust_2 <- which(clustnum[,2] == 2)
clust_3 <- which(clustnum[,2] == 3)
clust_4 <- which(clustnum[,2] == 4)
clust_5 <- which(clustnum[,2] > 5)

# Creating the sample dataset
set.seed(3)
clust_2_samp <- sample(clust_2, 100)
set.seed(3)
clust_3_samp <- sample(clust_3, 100)
set.seed(3)
clust_4_samp <- sample(clust_4, 100)
set.seed(3)
clust_5_samp <- sample(clust_5, 100)
training_index <- c(clust_2_samp, clust_3_samp, clust_4_samp, clust_5_samp)

# Extracting cluster information
training_data <- extract.clust.stats(training_index, theta_frame, r_frame, mixmodout)

# Plotting the first 100 markers
for (n in clust_2_samp) {
  marker_name <- mixmodout[[n]][[2]]
  file_name <- paste(which(clust_2_samp == n), '_', marker_name, '.png', sep = "")
  clustplot <- plot.raw.clust(n, theta_frame, r_frame, mixmodout)
  ggsave(paste(file_name, '.png', sep = ""), plot = clustplot, device = 'png')
}

# Plotting markers 201:300
for (n in clust_3_samp) {
  marker_name <- mixmodout[[n]][[2]]
  file_name <- paste(which(clust_3_samp == n), '_', marker_name, '.png', sep = "")
  clustplot <- plot.raw.clust(n, theta_frame, r_frame, mixmodout)
  ggsave(paste(file_name, '.png', sep = ""), plot = clustplot, device = 'png')
}

# Plotting markers 301:400
for (n in clust_4_samp) {
  marker_name <- mixmodout[[n]][[2]]
  file_name <- paste(which(clust_4_samp == n), '_', marker_name, '.png', sep = "")
  clustplot <- plot.raw.clust(n, theta_frame, r_frame, mixmodout)
  ggsave(paste(file_name, '.png', sep = ""), plot = clustplot, device = 'png')
}

# Plotting markers 401:500
for (n in clust_5_samp) {
  marker_name <- mixmodout[[n]][[2]]
  file_name <- paste(which(clust_5_samp == n), '_', marker_name, '.png', sep = "")
  clustplot <- plot.raw.clust(n, theta_frame, r_frame, mixmodout)
  ggsave(paste(file_name, '.png', sep = ""), plot = clustplot, device = 'png')
}

# These plots were then used to score each cluster whether I thought it was noise or real

Here’s what the training set looks like:

# Reading in the training data
data.raw <- read.table('/home/mmcgowan/Dropbox/M.McGowanSharedFiles/Wheat_data/genotypeclustering/kamiak_clust/plot_examples/training_data.csv', header = T, sep = '\t', stringsAsFactors = F)

data.raw[1:10,]
##    category                   marker cluster       prop    xcoord
## 1         1        D_contig22790_269       1 0.97991071 0.0916442
## 2         0        D_contig22790_269       2 0.02008929 0.2377851
## 3         1      RFL_Contig1416_3921       1 0.97842262 0.3663278
## 4         0      RFL_Contig1416_3921       2 0.02157738 0.3733695
## 5         1 Excalibur_rep_c71163_225       1 0.98511905 0.0179073
## 6         0 Excalibur_rep_c71163_225       2 0.01488095 0.1732606
## 7         1     Excalibur_c4834_2003       1 0.02008929 0.3574809
## 8         0     Excalibur_c4834_2003       2 0.97991071 0.5045603
## 9         1          Kukri_c80964_72       1 0.98288690 0.5906906
## 10        0          Kukri_c80964_72       2 0.01711310 0.4227292
##       ycoord        var1         var2         xvar        yvar
## 1  0.4664488 0.002187762 0.0003615288 3.836061e-04 0.002167622
## 2  0.2203040 0.045530162 0.0142274518 4.495702e-02 0.017098959
## 3  1.2846571 0.004393299 0.0003716721 3.719550e-04 0.004396643
## 4  0.7909119 0.271798304 0.0179255230 1.856572e-02 0.281505386
## 5  0.9159507 0.005038310 0.0000219567 2.197329e-05 0.005042118
## 6  0.4903193 0.184138557 0.0208900017 2.198948e-02 0.193830060
## 7  0.3099243 0.090752681 0.0222673287 6.052143e-02 0.056845508
## 8  0.5040303 0.003222514 0.0011285464 1.310501e-03 0.003043866
## 9  2.0697504 0.009369652 0.0006033352 6.242720e-04 0.009356271
## 10 1.1455948 0.956871809 0.0331559054 7.892750e-02 0.956101473

Training a model

Now I want to try using a boosted tree method to determine a classification strategy I found an online guide useful for this: Gradient boosting model guide

# Creating a formatted data.frame for analyis using gbm
data <- data.frame(data.raw$marker, data.raw$cluster, data.raw$prop, data.raw$xcoord, data.raw$ycoord, data.raw$var1, data.raw$var2, data.raw$category)
names(data) <- c('marker', 'cluster', 'prop', 'xcoord', 'ycoord', 'var1', 'var2', 'category')

# Converting the classifier into a factor
data$category <- ifelse(data$category==1,'real','noise')
data$category <- as.factor(data$category)

# Splitting the data into a training and test set
set.seed(23)
sample <- sample(1:nrow(data), round(0.1 * nrow(data)))
data.train <- data[-sample,]
data.test <- data[sample,]

# Configuring Caret to automatically control the resampling/testing of my data (training it 10 times in this case)
objControl <- trainControl(method='cv', number=10, returnResamp='none', summaryFunction = twoClassSummary, classProbs = TRUE)

# Setting outcome and predictors
outcome_name <- 'category'
predictor_names <- c('prop', 'xcoord', 'ycoord', 'var1', 'var2')

# Running the model
model_gbm <- train(data.train[,predictor_names], data.train[,outcome_name],
                   method = 'gbm',
                   trControl=objControl,
                   metric = 'ROC',
                   preProc = c('center', 'scale')
                   )
## Loading required package: plyr
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.2273            -nan     0.1000    0.0800
##      2        1.0947            -nan     0.1000    0.0653
##      3        0.9854            -nan     0.1000    0.0542
##      4        0.8906            -nan     0.1000    0.0458
##      5        0.8085            -nan     0.1000    0.0385
##      6        0.7413            -nan     0.1000    0.0325
##      7        0.6844            -nan     0.1000    0.0284
##      8        0.6336            -nan     0.1000    0.0250
##      9        0.5907            -nan     0.1000    0.0214
##     10        0.5531            -nan     0.1000    0.0172
##     20        0.3433            -nan     0.1000    0.0036
##     40        0.2362            -nan     0.1000    0.0001
##     50        0.2202            -nan     0.1000   -0.0002
# Reading in the boosted model
readRDS(model_gbm, file = '/home/mmcgowan/Dropbox/M.McGowanSharedFiles/Wheat_data/genotypeclustering/kamiak_clust/model_gbm.RDS')

Evaluating the GBM model

Now that I have a model, I will investigate what the properties of this model are.

The first thing I will look at is what variables were most important for determination:

summary(model_gbm)

##           var    rel.inf
## prop     prop 91.7051565
## var2     var2  6.5368428
## ycoord ycoord  0.7219190
## var1     var1  0.6841521
## xcoord xcoord  0.3519296

It looks like the primary contributor is the proportion of samples in the cluster, which makes a lot of sense! Then the next contributor is the var2 component. The other components don’t seem to be helping much in classifying the clusters.

Now to test what kind of accuracy this model has by using our testing data I set aside.

predictions <- predict(object=model_gbm, data.test[,predictor_names], type = 'raw')
head(predictions)
## [1] noise noise noise noise real  noise
## Levels: noise real
print(postResample(pred = predictions, obs = as.factor(data.test[,outcome_name])))
##  Accuracy     Kappa 
## 0.9677419 0.9354436

That looks pretty good, 97% of the testing clusters I set aside were classified correctly!

Lets look to see what clusters were wrongly identified.

which(data.test$category != predictions)
## [1]  10  12  77 120 140
data.test[which(data.test$category != predictions),]
##                       marker cluster       prop     xcoord    ycoord
## 1536           BS00076596_51       1 0.06994048 0.65176497 0.7741730
## 1080 wsnp_Ex_c19371_28311667       4 0.05282738 0.58338765 0.7009184
## 1188           GENE-2906_249       2 0.03497024 1.00000000 0.7255342
## 1530    Excalibur_c17206_256       1 0.02678571 0.71990417 1.2360362
## 379    Excalibur_c115824_342       2 0.06622024 0.03714632 0.5056953
##             var1         var2 category
## 1536 0.040573725 1.810731e-02     real
## 1080 0.044422275 4.332317e-03    noise
## 1188 0.002180045 3.710452e-10    noise
## 1530 0.004698476 7.934533e-04     real
## 379  0.022747278 9.359500e-05    noise
x <- which(theta_frame$Name == 'BS00076596_51')
plot.raw.clust(x, theta_frame, r_frame, mixmodout)

x <- which(theta_frame$Name == 'wsnp_Ex_c19371_28311667')
plot.raw.clust(x, theta_frame, r_frame, mixmodout)

x <- which(theta_frame$Name == 'Excalibur_c115824_342')
plot.raw.clust(x, theta_frame, r_frame, mixmodout)

Looking at these plots, it makes a lot of sense why the first two plots misclassified the center ‘real’ clusters as ‘noise’. Despite there being a clear distribution in the middle, it is very debatable whether they could be considered ‘real’.

The third plot example is more confusing. There seems to be a cluster nested in the primary one that has a very tight distribution, but also contains a fair proportion of samples (7%). This actually seems to be more of a problem with the Rmixmod results rather than the classification model. I think it may be possible to identify ‘real’ clusters that could be merged after filtering.

Using the classification model to filter clusters.

Now that I have a reasonably accurate model to predict whether a cluster is ‘real’ or ‘noise’, I now apply the model to every cluster.

# Extracting stats for all markers
all_clust_stats <- extract.clust.stats(indices = 1:length(mixmodout), theta_frame, r_frame, mixmodout)
# Reading in all_clust_stats (to improve Knit time)
readRDS(all_clust_stats, file = '/home/mmcgowan/Dropbox/M.McGowanSharedFiles/Wheat_data/genotypeclustering/kamiak_clust/all_clust_stats.RDS')

# Predicting whether a cluster is 'real' or 'noise'
all_clust_predict <- predict(object=model_gbm, all_clust_stats[,predictor_names], type = 'raw')

# Adding these predictions to the cluster stats frame, but coercing it into a logical
all_clust_stats$category <- all_clust_predict
all_clust_stats$cat_bin <- as.logical(as.numeric(all_clust_predict)-1)

# Calculating how many clusters remained for each marker after remoing 'noise'
clustnum_filt <- vector(mode = 'integer', length = length(mixmodout))
for (i in 1:length(mixmodout)) {
  cat('\r', 'Processing marker ', i, ' out of ', length(mixmodout), sep = '')
  marker_name <- mixmodout[[i]][[2]]
  real_clust <- which(all_clust_stats$marker == marker_name & all_clust_stats$cat_bin)
  clustnum_filt[i] <- length(real_clust)
}

# Storing data in a viewable frame
clust_num_frame <- data.frame(matrix(NA, ncol = 3, nrow = length(clustnum)))
names(clust_num_frame) <- c('marker', 'prefilter', 'postfilter')
clust_num_frame[,1] <- theta_frame$Name
clust_num_frame[,2] <- clustnum
clust_num_frame[,3] <- clustnum_filt

Then I plot a distribution of the post-filter cluster count for each marker and output a table

qplot(clust_num_frame$postfilter, bins = 9) + ggtitle('Histogram of Post-Filter Cluster Count')

table(clust_num_frame$postfilter)
## 
##     0     1     2     3     4     5     6     7     8 
##   283 39340 17681 17453  4441  1639   571   167    12

Based on this table, roughly half of the 81k markers are monomorphic. Also I noticed that there were 283 markers that had zero clusters…weird. So I plotted a few of them:

zero_clust <- which(clust_num_frame$postfilter == 0)[1:10]
zero_clust
##  [1]  180  387 1642 2139 2973 3992 4093 5906 6548 6984
for (i in zero_clust) {
  print(plot.raw.clust(i, theta_frame, r_frame, mixmodout))
}

Hmm, for some reason, the centers of the clusters are all at (0,0) and their variances are very big. Why did the clustering centers fail in this case? I’m not sure, but I will table this for later….

Nevertheless, if a majority of the 2 and 3 cluster markers are correct, I will still have a lot of useful markers (35k markers). Let’s look at a subset of these markers.

two_clust <- which(clust_num_frame$postfilter == 2)[1:20]
for (i in two_clust) {
  marker_name <- mixmodout[[i]][[2]]
  file_name <- paste(marker_name, '.png', sep = "")
  clustplot <- plot.raw.clust(i, theta_frame, r_frame, mixmodout)
  ggsave(paste(file_name, '.png', sep = ""), plot = clustplot, device = 'png')
}

three_clust <- which(clust_num_frame$postfilter == 3)[1:20]
for (i in three_clust) {
  marker_name <- mixmodout[[i]][[2]]
  file_name <- paste(marker_name, '.png', sep = "")
  clustplot <- plot.raw.clust(i, theta_frame, r_frame, mixmodout)
  ggsave(paste(file_name, '.png', sep = ""), plot = clustplot, device = 'png')
}
two_clust <- which(clust_num_frame$postfilter == 2)[1:5]
for (i in two_clust) {
 print(plot.raw.clust(i, theta_frame, r_frame, mixmodout))
}

## Warning in chol.default(shape, pivot = TRUE): the matrix is either rank-
## deficient or indefinite

three_clust <- which(clust_num_frame$postfilter == 3)[1:5]
for (i in two_clust) {
  print(plot.raw.clust(i, theta_frame, r_frame, mixmodout))
}

## Warning in chol.default(shape, pivot = TRUE): the matrix is either rank-
## deficient or indefinite

Ok, most all of these markers seem to make sense. However, 3 cluster markers seem to fall into two different situations where they are all equidistant and some are not.

Filtering 3-cluster markers

Because I assume the effect of a single SNP to have an additive effect on the theta readout, I want to test using a second filter to further filter 3 cluster markers to only ones that follow the equidistant assumption.

three_clust <- which(clust_num_frame$postfilter == 3)
three_clust_names <- as.character(theta_frame$Name[three_clust])
three_clust_stats <- all_clust_stats[all_clust_stats$marker %in% three_clust_names & all_clust_stats$cat_bin == T,]
three_clust_ratio <- vector(mode = 'integer', length = length(three_clust_names))
three_clust_mindist <- vector(mode = 'integer', length = length(three_clust_names))

for (i in 1:length(three_clust_names)) {
  xcoord <- three_clust_stats$xcoord[three_clust_stats$marker == three_clust_names[i]]
  distances <- sort(dist(xcoord))
  ratio <- distances[1] / distances[2]
  three_clust_mindist[i] <- min(distances)
  three_clust_ratio[i] <- ratio
}

three_clust_dist_stats <- data.frame(three_clust_names, three_clust_mindist, three_clust_ratio, stringsAsFactors = F)
names(three_clust_dist_stats) <- c('names', 'mindist', 'ratio')

Plotting a histogram of the minimum cluster distance for 3 cluster markers

qplot(three_clust_dist_stats$mindist) + ggtitle('3 Cluster marker: Histogram of minimum x-distance')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Plotting a histogram of the ratio of the distance between the left to middle to right clusters

qplot(three_clust_dist_stats$mindist) + ggtitle('3 Cluster marker: Histogram of left to right cluster distance ratio')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Clearly there is a strong bimodal distribution for the minimum distance. My guess is that these are markers where there was one or zero-value truncation causing two clusters to be found very close together. Using a cutpoint of 0.025 - 0.05 should filter these out.

Also, there is another bimodal split between markers with a ratio roughly equidistant (~1) and non-equal (~0). A cutpoint between 0.25 and 0.5 should take care of these.

three_clust_filter2_list <- three_clust_dist_stats$names[three_clust_dist_stats$mindist > 0.025 & three_clust_dist_stats$ratio > 0.375]

Now lets look at a random set of 5.

set.seed(3)
sample_list <- sample(three_clust_filter2_list, 5)
sample_index <- which(theta_frame$Name %in% sample_list)
for (i in 1:length(sample_index)) {
  index <- sample_index[i]
  marker_name <- mixmodout[[index]][[2]]
  file_name <- paste(i, '_', marker_name, '.png', sep = "")
  clustplot <- plot.filter.clust(index, theta_frame, r_frame, mixmodout, all_clust_stats)
  ggsave(paste(file_name, '.png', sep = ""), plot = clustplot, device = 'png')
}
for (i in 1:length(sample_index)) {
  index <- sample_index[i]
  marker_name <- mixmodout[[index]][[2]]
  clustplot <- plot.filter.clust(index, theta_frame, r_frame, mixmodout, all_clust_stats)
  print(clustplot)
}

These all look great! Just to be sure, lets look at the most extreme ratio values that are right above my cutpoint.

three_clust_filter2 <- three_clust_dist_stats[three_clust_dist_stats$names %in% three_clust_filter2_list,]
extreme_test <- three_clust_filter2[order(three_clust_filter2$ratio),][1:5,1]
sample_index <- which(theta_frame$Name %in% extreme_test)
for (i in 1:length(sample_index)) {
  index <- sample_index[i]
  marker_name <- mixmodout[[index]][[2]]
  clustplot <- plot.filter.clust(index, theta_frame, r_frame, mixmodout, all_clust_stats)
  print(clustplot)
}

Even the extreme examples still make sense from a clustering perspective. Considering that these are the minority of my three cluster markers, I think my filters have worked in picking out three cluster markers that match eyeball expectations for a bi-allelic marker with three genotypes (AA, AB, BB).

Filtering 2-cluster markers

Now I want to try filtering 2-cluster markers. Like before, I calculated the minimum distance between clusters (since there are only two, there is only one distance). Then I plotted a histogram of these distances.

two_clust <- which(clust_num_frame$postfilter == 2)
two_clust_names <- as.character(theta_frame$Name[two_clust])
two_clust_stats <- all_clust_stats[all_clust_stats$marker %in% two_clust_names & all_clust_stats$cat_bin == T,]
two_clust_mindist <- vector(mode = 'integer', length = length(two_clust_names))

for (i in 1:length(two_clust_names)) {
  xcoord <- two_clust_stats$xcoord[two_clust_stats$marker == two_clust_names[i]]
  distances <- sort(dist(xcoord))
  ratio <- distances[1] / distances[2]
  two_clust_mindist[i] <- min(distances)
  }

two_clust_dist_stats <- data.frame(two_clust_names, two_clust_mindist, stringsAsFactors = F)
names(two_clust_dist_stats) <- c('names', 'mindist')

qplot(two_clust_dist_stats$mindist) + ggtitle('Histogram of two-cluster marker x-coordinate center distance')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Weird, the distribution appears to be tetra-modal. The modes are roughly: (0, 0.125, 0.35, 0.9). The 0.9 makes the most sense for biallelic SNPs with no homeologues since AA would be theta = 0 and BB would be theta = 1. However, with homeologue interaction, 0.35 also makes some sense. Before making any more conjectures, I plotted a random sample from each mode.

cat1_names <- sample(two_clust_dist_stats$names[two_clust_dist_stats$mindist < 0.025], 5)
cat2_names <- sample(two_clust_dist_stats$names[two_clust_dist_stats$mindist > 0.05 & two_clust_dist_stats$mindist < 0.1], 5)
cat3_names <- sample(two_clust_dist_stats$names[two_clust_dist_stats$mindist > 0.3 & two_clust_dist_stats$mindist < 0.4], 5)
cat4_names <- sample(two_clust_dist_stats$names[two_clust_dist_stats$mindist > 0.8], 5)

cat1_index <- which(theta_frame$Name %in% cat1_names)
cat2_index <- which(theta_frame$Name %in% cat2_names)
cat3_index <- which(theta_frame$Name %in% cat3_names)
cat4_index <- which(theta_frame$Name %in% cat4_names)

5 Plots for markers with an x-distance < 0.025

## Warning: Removed 26 rows containing missing values (geom_path).

5 Plots for markers with an x-distance > 0.05 & < 0.1

for (i in 1:length(cat2_index)) {
  index <- cat2_index[i]
  marker_name <- mixmodout[[index]][[2]]
  clustplot <- plot.filter.clust(index, theta_frame, r_frame, mixmodout, all_clust_stats)
  print(clustplot)
}

5 Plots for markers with an x-distance > 0.3 & < 0.4

for (i in 1:length(cat3_index)) {
  index <- cat3_index[i]
  marker_name <- mixmodout[[index]][[2]]
  clustplot <- plot.filter.clust(index, theta_frame, r_frame, mixmodout, all_clust_stats)
  print(clustplot)
}

5 Plots for markers with an x-distance > 0.8

for (i in 1:length(cat4_index)) {
  index <- cat4_index[i]
  marker_name <- mixmodout[[index]][[2]]
  clustplot <- plot.filter.clust(index, theta_frame, r_frame, mixmodout, all_clust_stats)
  print(clustplot)
}

Based on these samples, all categories seem ok except for the first. Some of them do seem to have a weak middle cluster that was flagged as noise, but I think the filtering is still acceptable.

Extracting filtered markers and calling genotypes

First, I need to extract a list of marker names that I want to keep. Then I subset the partition file to only include markers in the list.

two_cluster_filter_list <- two_clust_dist_stats$names[two_clust_dist_stats$mindist > 0.0375]
total_clust_filter_list <- c(two_cluster_filter_list, three_clust_filter2_list)

Then I call the extract.genotype function. This function checks whether the marker is a 2 or 3 cluster marker and then assigns genotypes according to the theta value order of the clusters. So from left to right, a 2 cluster marker will be (0,2) and a 3 cluster marker will be (0,1,2).

genotype_matrix <- extract.genotype(total_clust_filter_list, theta_frame, r_frame, mixmodout, all_clust_stats)
row.names(genotype_matrix) <- names(theta_frame)[-1]

Subsetting QAM samples, fixing sample names, and saving for later

To figure out which samples are QAM, I checked the row names for my genotype matrix. After determining which ones were QAM, I subset them.
Next I used an Excel file provided by Dr. Carter to create a name conversion table. I visually checked to confirm the original order matches my genotype frame and then replaced the genotype matrix row names with the corrected sample names from the conversion table. Then I saved the R object as an RDS file for further analysis.

# Subset only QAM samples
QAM_genotypes <- genotype_matrix[841:1320,]

# Read in the conversion table
sample_guide <- read.table('/home/mmcgowan/Dropbox/M.McGowanSharedFiles/Wheat_data/genotypeclustering/sampleconversion_guide.csv', sep = '\t', header = T)
# Replace the sample names
row.names(QAM_genotypes) <- sample_guide$Corrected
# Saving the genotype matrix (in binary format for easier reading)
saveRDS(QAM_genotypes, file = 'QAM_genotypes.RDS')